The Internet, it turns out, is NOT a microcosm for society.
NOTICE: This website is currently a WORK-IN-PROGRESS. I am still adding analysis and data, which is why it might look a bit funky towards the bottom of this page. There will also be spelling/grammar mistakes, as well as bad wording (I have not proofread any of my writing). But, something is always better than nothing, and I really needed to double check that Github pages works… I’m learning from my previous mistakes… Welcome to the slow process of analysis, and enjoy your stay (while it lasts)!
This is an analysis of Big Five results, with raw data collected from Open Psychometrics. It contains over 1 million results from various countries, making it quite a large file to read through.
My original goal of this project was to dabble in a bit of data science – data genuinely fascinates me. So, learning how to parse, clean, and modify it, is always going to be a funny endeavor. However, I wanted to drag a bit of psychology in here, just to make it more engaging for me!
If you’re interested in a specific part, you may skip ahead:
Introduction
Data
Further Analysis
After looking at the raw data, there were several questions I wanted to solve:
The first three are purely data-based, as they’re just an exploration of the quiz answers. Number four, on the other hand, is trying to understand WHY the results are the way they are. Personally, I find that it’s the most interesting aspect of this: The results are cool, but understanding the people behind the quiz is far more entertaining.
However, I’m also looking to learn more about the Big Five as a personality model, with a specific interest in common trends and future applications. This means that the second half of this investigation will focus on psychology rather than data science, with a heavier emphasis on writing than code. Furthermore, the tone will remain casual throughout the investigation and I will frequently use personal pronouns and other scientific no-goes - It’s not an academic paper for a reason!
The Big Five is a personality model that categories people with 5 major personality traits
Hence, a possible result could look like RLOEN, standing for reserved, limbic, organized, egocentric and non-inquisitive. This can easily be seen by looking at the beginning letter of each trait. However, an ‘X’ might also be seen. This symbolizes that they equally represent both sides. Therefore, there is no conclusive answer, resulting in a ‘X’, such as SXUAN.
The specific order and formatting of these traits uses the SLOAN formatting, and this investigation will continue using this syntax. However, other common formatting include OCEAN and CANOE. The specific wording (e.g. Extroversion, Conscientiousness) of traits may also vary from one investigation to another. This investigation will continue using the wording above.
These five major traits can also be broken down further: Extroversion can also contain traits like gregariousness and excitement-seeking. However, these subsets are not within the scope of this investigation.
Furthermore, it should be noted that one person is not entirely one or the other. People exist on a spectrum of these results, despite how black and white they seem to be portrayed. The Big Five result tends to
Interested in what your results are? The Open Psychometrics Big Five quiz can be taken here.
Citations
This quiz is self-administered, with users given 50 questions, meaning 10 from each trait. Every question was ranked on a Likert scale (a 5 point scale) with 1 meaning I disagree with the prompt, 3 meaning I am neutral on the prompt, and 5 meaning I agree with the prompt.
Some sample prompts include:
Questions from each category were rotated, starting from Extroversion Q1, Agreeableness Q1, Conscientiousness Q1, Neuroticism Q1, Openness Q1, Extroversion Q2, etc.
Data was also collected on the country of the quiz-taker, the amount of time spent on each question, date taken, and more. In total, 1,015,342 answers were collected over ~2 years, with consent from the user.
Unfortunately, within this data set, Open Psychometrics did not provide us with specific results (e.g. SCUEI), but with a bunch of numbers. In fact, this is what the data looks like:
I’ve only selected the first 3 columns of the first 6 responses. Not too bad, right? But, you’ve gotta imagine about 50 more columns, and about a million more rows. It’s a little more intimidating now, but we’ve gotta start somewhere >:)
First, let’s calculate the scores for the Extroversion personality trait. We can do this by summing up the scores of the 10 EXT (standing for EXTroversion) questions. Of course, it should be noted that answering a ‘4’ or ‘5’ on a EXT question doesn’t always mean that a user is more extroverted. For example, EXT Question 2: I don’t talk a lot vs EXT Question 1: I am the life of the party.
So, we’ll need to make sure to add and subtract scores as necessary - I’ve arbitrarily made the decision to add ‘points’ if a question symbolizes extroversion, while subtracting ‘points’ if a question is correlated with introversion.
Hence, after adding up all the scores, any positive score (> 0) will mean that the user receives an ‘S’ for Sociable, while a negative score (< 0) would result in an ‘R’. If their score is 0, then they’d receive an ‘X’, as their results can’t really be evaluated since they’re perfectly in-between!
Sample table of results:
Visualizing this data into a graph:
I’m going to refrain from any analysis on WHY there’s more R’s compared to S’s until the final results. But, it’s a pretty 50/50 split!
The same process will be applied to the other personality traits1, with the graphs of results located below:
After getting these values, we can now combine them together to get the results. I’m also going to limit this to the top 25 results… or else the data just gets super messy.
WOOO! Congratulations to the XXOAI family for being among the most common result out of 243 possibilities! If you’d prefer to see a more numerical visual, I’ve provided a table containing the 10 most common results below:
| Results | Total Amount | Percentage of People |
|---|---|---|
| SCOAI | 160758 | 15.832892 |
| RLOAI | 132305 | 13.030585 |
| RCOAI | 106355 | 10.474796 |
| RLUAI | 98833 | 9.733962 |
| SLOAI | 96998 | 9.553234 |
| SLUAI | 72098 | 7.100859 |
| SCUAI | 54528 | 5.370407 |
| RCUAI | 40040 | 3.943499 |
| RLXAI | 14370 | 1.415287 |
| SXOAI | 11818 | 1.163943 |
| Results | Total Amount | Percentage of People |
|---|---|---|
| SCOAI | 160758 | 15.832892 |
| RLOAI | 132305 | 13.030585 |
| RCOAI | 106355 | 10.474796 |
| RLUAI | 98833 | 9.733962 |
| SLOAI | 96998 | 9.553234 |
| SLUAI | 72098 | 7.100859 |
| SCUAI | 54528 | 5.370407 |
| RCUAI | 40040 | 3.943499 |
| RLXAI | 14370 | 1.415287 |
| SXOAI | 11818 | 1.163943 |
Let’s try to analyze these results.
Extrovertism
There seems to be a nice mix of S(ocial) and R(eserved) values, which is representative of the larger population. It’s the most balanced compared to the other 4 traits, and actually mirror what is expected. This is likely because it’s quite easy for oneself to decide if they’re ‘introverted’ or ‘extroverted’, and the questions (e.g. “I start conversations”) tended to be quite straight-forward and are experiences that they would’ve gone through in their daily life.
When looking at TABLENUM, There are technically about 40000 more R’s than S’s. If I wanted to stereotype (which is an inevitable part of ‘absolute letters’, rather than 70% R vs 30% S), maybe the S’s aren’t as likely to spend 20 minutes on an online test, compared to R’s, who are more likely to be holed up in their room.
**Neuroticism* *
Once again, a solid sample of results between C(alm) and L(imbic). I think it’s similar to extrovertism, where it’s quite easy to determine how strongly you feel your emotions, and your ability to control them.
Scrolling up to TABLENUM, there are more limbic people, compared to calm, by about 60000. I’m not really surprised, in fact, I’d think they’d be more limbic people, as society slowly puts more emphasis on mental health and actually exploring our emotions. This could make us more moody - we actually try to work through our emotions, rather than repressing them, and pretending to be calm. However, it could also be the other way around! By engaging with our emotions, we can better understand ourselves, and lead a life where we are calming since we build control!
Conscientiousness
Same as above, for O(rganized) and U(nstructured). Relatively simple to identify, applicable in daily life.
There’s a lot more O’s than U’s, which I’m also surprised about? There’s also the most X’s in TABLENUM, compared to others. The X’s, I think, can be explained since people fluctuate between having super organized lives, but a messy desk? It’s easy to be both organized and unorganized. As for the large amount of O’s… It’s clearly not represented as much in TABLENUM, considering there’s basically an equal number of each response. It could be because people want to see themselves as organized, or they think they’re organized (e.g. they know where everything is on their desk, but items are actually strewn all over), when in reality, they aren’t! A good thought to carry for the next two personality traits.
Agreeableness
This is where it gets a bit funny! There is not a SINGLE E(gocentric) - it’s just a sea of A(greeable). This is probably because this is a self-administered test, people want to be seen as agreeable in society (correlated with ‘niceness’), and people are biased.
Let me tell you a story that I heard from a TEDTalk. There’s this guy, I think he’s a magician, and he’s at the airport waiting. Something bad has happened that I can’t recall: Maybe his flight has been delayed by 2 hours, or he really needs a refund, or he’s lost his luggage. So, he calls their airline, and the customer representative is clearly in a bad mood, maybe a little bit angry or grouchy, and he’s getting nowhere with his request. The guy hears her voice, and realizes: “Oh, the representative must think that I’M so lucky (despite how she sounds) that I’M talking to her, because she’s taking SO MUCH TIME out of HER day to help me.” So, the guy goes “Hey! I really appreciate you for helping me, and I’m really grateful that I’m talking to you!” The flight attendant INSTANTLY is in a better mood, and becomes genuinely helpful, and ends up solving his request, after he acknowledges her struggles and actually is really kind to the flight attendant, who’s far more used to being yelled at!
So, aside from being a heart-warming story, it also serves to prove two things:
So, it makes sense why, if we’re inputting our own answers without requiring proof, people can naturally skew themselves into thinking that they’re nicer than they actually are. I think the same concept applies to agreeability - we like to think that we’re agreeable, because it’s correlated with niceness, and we like being nice. Not only that, but we want to be liked, and oftentimes, by agreeing with others, people have a nicer opinion of us! Lastly, it’s a lot easier to agree with others, than to go against the status quo. Maybe people do disagree, but they’ve just repressed their own thoughts because… society.2 The opposite also applies to egocentrism - nobody wants to call themselves egocentric, as it’s associated with narcissism and self-centeredness. They’re traits that nobody wants. Not only that, but who wants to admit that they agree with “I feel little concern for others.”
Openness
It’s anti-egocentrism part 2, except this time, people are refusing to call themselves non-inquisitive! In fact, it’s even worse than before, if you look at TABLENUM! Once again, it goes back to the same ideas:
However, I think one major point that Openness has that Agreeability doesn’t, is just the fact that taking a Big Five personality quiz, inherently means that you’re going to be more inquisitive. If you weren’t curious about it in the first place, you just… wouldn’t take the quiz or care about the result. Hence, there’s some sample bias, as the people you’re getting results from, are already skewed towards inquisition. In addition, psychology is seen as something ‘nerdy’ and scientific, which caters to ‘smarter’ or ‘more educated’ audiences.
Hence, someone might agree with “I have a rich vocabulary”, not because they’re interested in learning new words, but just because they perceive themselves as having a lot of knowledge due to their education. Or answer a 1 (disagree) to “I have difficulty understanding abstract ideas” or a 5 (agree) to “I am full of ideas” because, through education, they’ve trained themselves to be better at comprehending abstract ideas or becoming better at brainstorming.
Sheer Statistical Impossibility…?
However, When analyzing this data, it seems.. strange that so many XXOAI types are represented. In fact, it feels weird that 15%(!!!!) of people were the EXACT SAME TYPE, despite ALL THE POSSIBILITIES!
So, let’s see how it compares to other, more theoritical, data. Using the data3 from SimilarMinds, we can see how our data stacks up.
|
|
|
|
Looking at these tables, there’s a clear discrepancy between theoretical and experimental data, where none of the theoretical results really match up to what is seen in our data set. However, most of it can be explained by the previous analysis: Wrong results might occur because of bias and societal norms (cementing the last two letters to be A and I), in addition to the natural disposition of responders. This comparison really just showcases how there’s a high likelihood of bias.
Honestly, I’m not too sure why 15% of people are SCOAI, maybe people are being influenced to answer what they WANT to be like, rather than what they actually are, since SCOAI seems like one of the most ‘socially-successful’ results. But, in reality, responders aren’t actually SCOAIs. That’s my best guess!4
Also, if you’re interested, here’s a list of the most uncommon results!
| finalletters | total |
|---|---|
| SLXXX | 1 |
| XLUXX | 1 |
| XLXXX | 1 |
| XXOXX | 1 |
| XXUXN | 1 |
| XXXXN | 1 |
| RXXEX | 2 |
| RXXXX | 2 |
| SXXEN | 2 |
| SXXXN | 2 |
No surprise, it’s a lot of results with X’s. It’s pretty hard to get an X, because you’re perfectly in between!
At first, I was surprised that ‘XXXXX’ didn’t appear, since it’s technically the most unlikely.5 However, I wouldn’t be surprised if only 5% of those results were genuine user responses, and the other 95% were just people who kept clicking 3 (Neutral), or had some kind of game to see if they could perfectly answered the questions to get XXXXX.
The neuroticism table was actually calculated differently. For the other 4 traits, there were an equal number of ‘positive (e.g. extroversion) questions’ and ‘negative (e.g. introversion) questions.’ Neuroticism had 2 ‘positive’ questions and 8 negative questions. So, subtracted 3 from each result, where a score of 5 (Agree) became +2, and a score of 1 (negative) became -2. This helped balance the questions, while using a similar process. I tried using this method with the other tables and got the same values as my previous method, which is good to see!↩︎
I use agreeableness pretty synonymously with niceness. One can critique that isn’t the case, and that agreeability and niceness are more distinct than I make them out to be. That can absolutely be the case, and the story is not relevant. However, the point about society preferring agreeability still stands, and society will still influence how you act and how you perceive yourself.↩︎
This was the only site that had theoretical values. I don’t know where they got their percentages from, and there was also no data on certain combinations. It should also be noted that this site does not use ‘X’ as a possible result. E.g. XLUEI (or any combination with an ‘X’) is not considered. So this source can not be completely trusted. It should also be noted that SimilarMinds separated the theoretical values by female and male. Since this data set does not have this distinction, I used the average of the male and female theoretical values.↩︎
If you have more ideas, I’d love to hear them! Reach out :)↩︎
There’s actually 3794 results for XXXXX↩︎